A Short Intro to Web Scraping (in R)

Johannes B. Gruber

2024-05-02

Introduction

The Plan for Today

  • Learn what Web Scraping is
  • Get an understanding of the web
  • Learn how to identify patterns you can use for scraping
  • Get an overview of relevant tools
  • Learn about legal and ethical concerns (and myths)

Louis Hansel via unsplash.com

Who am I?

  • PostDoc at Department of Language, Literature and Communication at Vrije Universiteit Amsterdam and University of Amsterdam
  • Interested in:
    • Computational Social Science
    • Automated Text Analysis
    • Hybrid Media Systems and Information Flows
    • Protest and Democracy
  • Experience:
    • R user for 9 years
    • R package developer for 7 years
    • Worked on several packages for text analysis, API access and web scraping (quanteda.textmodels, LexisNexisTools, paperboy, traktok, rollama, amcat4-r, and more)

Who are you?

  • What is your name?
  • What are your research interests?
  • What is your experience with:
    • R
    • HTML
    • web scraping
  • Why are you taking this course?
  • Do you have specific plans that include web scraping?
  • What operating system are you using?

What is Web Scraping & should You Learn/Use it?

What is Web Scraping

  • Used when other means are unavailable
  • Scrape the (unstructured) Data
  • A web-scraper is a program (or robot) that:
    • goes to a web page
    • downloads its content
    • extracts data from the content
    • then saves the data to a file or a database
  • Unfortunately no one-size-fits-all solution
    • Lots of different techniques, tools, tricks
    • Websites change (some more frequently than others)
    • Some websites make it hard for you (by accident or on purpose!)

Image Source: daveberesford.co.uk

Web Scraping: A Three-Step Process

  1. Send an HTTP request to the webpage -> server responds to the request by returning (HTML) content
  2. Parse the HTML content -> extract the information you want from the nested structure of (HTML) code
  3. Wrangle the data into a useful format

Original Image Source: prowebscraper.com

Why Should You Learn Web Scraping?

  • The internet is a data gold mine!
  • Data were not created for research, but are often traces of what people are actually doing on the internet
  • Reproducible and renewable data collection (e.g., rehydrate data that is copyrighted)
  • Web Scraping let’s you automate data retrieval (as opposed to using tedious copy & past on some web site)
  • It’s one of the most fun tasks to learn R and programming!
    • It’s engaging and satisfying to find repeating patterns that you can employ to structure data (every website becomes a little puzzle)
    • It touches on many important computational skills
    • The return is good data to further your career (unlike sudokus or video games)

What are HTML and CSS

What is HTML

  • HTML (HyperText Markup Language) is the standard markup language for documents designed to be displayed in a web browser
  • Contains the raw data (text, URLs to pictures and videos) plus defines the layout and some of the styling of text
{\displaystyle \overbrace {\overbrace {{\mathtt {\color {BrickRed}<\!p\ }}\color {Magenta}\underbrace {\mathtt {class}} _{\mathsf {\color {Black}{Attribute \atop name}}}{\mathtt {=''}}\!\underbrace {\mathtt {paragraph}} _{\mathsf {\color {White}{Attr} \atop \color {Black}Attribute\ value}}''{\mathtt {\color {BrickRed}>}}} ^{\mathsf {Start\ tag}}\overbrace {\mathtt {\color {Green}This\ is\ a\ paragraph.}} ^{\mathsf {Content}}\overbrace {\mathtt {\color {BrickRed}<\!/p\!>}} ^{\mathsf {End \atop tag}}} ^{\mathsf {Element}}}

Image Source: Wikipedia.org

Example: Simple

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <p>This is the body of the text.</p>
</body>
</html>

Browser View:

Example: With headline and author

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author" href="https://www.johannesbgruber.eu/">Me</p>
    <p>This is the body of the text.</p>
</body>
</html>

Browser View:

Example: With some data

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author">Me</p>
    <p>This is the body of the text.</p>
    <p>Consider this data:</p>
    <table>
        <tr>
            <th>Name</th>
            <th>Age</th>
        </tr>
        <tr>
            <td>John</td>
            <td>25</td>
        </tr>
        <tr>
            <td>Mary</td>
            <td>26</td>
        </tr>
    </table>
</body>
</html>

Browser View:

Example: With an image

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author">Me</p>
    <p>This is the body of the text.</p>
    <p>Consider this image:</p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/About_The_Dog.jpg/640px-About_The_Dog.jpg" alt="About The Dog."></img>
</body>
</html>

Browser View:

What is CSS

  • CSS (Cascading Style Sheets) is very often used in addition to HTML to control the presentation of a document
  • Designed to enable the separation of content from things concerning the look, such as layout, colours, and fonts.
  • The reason it is interesting for web scraping is that certain information often get the same styling

Example: CSS

HTML:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
    <link rel="stylesheet" type="text/css" href="example.css">
</head>
<body>
  <h1 class="headline">My Headline</h1>
  <p class="author">Me</p>
  <div class="content">
    <p>This is the body of the text.</p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/About_The_Dog.jpg/640px-About_The_Dog.jpg" alt="About The Dog.">
    <p>Consider this data:</p>
    <table>
      <tr class="top-row">
          <th>Name</th>
          <th>Age</th>
      </tr>
      <tr>
          <td>John</td>
          <td>25</td>
      </tr>
      <tr>
          <td>Mary</td>
          <td>26</td>
      </tr>
    </table>
  </div>
</body>
</body>
</html>

CSS:

/* CSS file */

.headline {
  color: red;
}

.author {
  color: grey;
  font-style: italic;
  font-weight: bold;
}

.top-row {
  background-color: lightgrey;
}

.content img {
  border: 2px solid black;
}

table, th, td {
  border: 1px solid black;
}

Browser View:

Execises: HTML and CSS

  1. Add another image and another paragraph to data/example.html and display it in your browser
  2. Add a second level headline to the page
  3. Add another image to the page

HTMl and CSS in Web Scraping

Using HTML tags:

You can select HTML elements by their tags

library(rvest)
read_html("data/example.html") |>  # retrieve content
  html_elements("p") |>            # select content via css selector
  html_text2()                     # extract data you want
[1] "Me"                            "This is the body of the text."
[3] "Consider this image:"          "Consider this data:"          
  • to select them, tags are written without the <>
  • in theory, arbitrary tags are possible, but commonly people use <p> (paragraph), <br> (line break), <h1>, <h2>, <h3>, … (first, second, third, … level headline), <b> (bold), <i> (italic), <img> (image), <a> (hyperlink), and a couple more.

Using attributes

You can select elements by an attribute, including the class:

read_html("data/example.html") |> 
  html_element("[class=\"headline\"]") |> 
  html_text()
[1] "My Headline"

For class, there is also a shorthand:

read_html("data/example.html") |> 
  html_element(".headline") |> 
  html_text()
[1] "My Headline"

Another important shorthand is #, which selects the id attribute:

read_html("data/example.html") |> 
  html_element("#table-1") |> 
  html_table()                     # html_table tries to re-assemble tables 
# A tibble: 2 × 2
  Name    Age
  <chr> <int>
1 John     25
2 Mary     26

Extracting attributes

Instead of selecting by attribute, you can also extract one or all attributes:

read_html("data/example.html") |> 
  html_elements("a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"   "https://en.wikipedia.org/wiki/Dog"
read_html("data/example.html") |> 
  html_elements("a") |> 
  html_attrs()
[[1]]
                             href 
"https://www.johannesbgruber.eu/" 

[[2]]
                               href 
"https://en.wikipedia.org/wiki/Dog" 

Chaining selectors

If there is more than one element that fits your selector, but you only want one of them, see if you can make your selection more specific by chaining selectors with > (for the immediate next one) or an empty space (for any children of an element):

read_html("data/example.html") |> 
  html_elements(".author>a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"
read_html("data/example.html") |> 
  html_elements(".author a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"

Tip: there is also no rule against doing this instead:

read_html("data/example.html") |> 
  html_elements(".author") |> 
  html_elements("a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"

Common Selectors

There are quite a lot of CSS selectors, but often you can stick to just a few:

selector example Selects
element/tag table all <table> elements
class .someTable all elements with class="someTable"
id #table-1 unique element with id="table-1"
element.class tr.headerRow all <tr> elements with the someTable class
element.class tr.headerRow all <tr> elements with the someTable class
class1.class2 .someTable.blue all elements with the someTable AND blue class
class1 > tag .table-1 > tr all elements with tr with .table-1 as parent
class1 + tag .top-row + tr first elements with tr following .top-row

Family Relations

Each html tag can contain other tags. To keep track of the relations we speak of ancestors, descendants, parents, children and siblings.

<book>
  <chapter>
    <section>
      <subsection>
        This is a subsection.
      </subsection>
      <subsection>
        This is another subsection.
      </subsection>
    </section>
    <section>
      This is a section.
    </section>
  </chapter>
  <chapter>
    <section>
      This is a section.
    </section>
    <section>
      This is a section.
    </section>
  </chapter>
  <chapter>
    This is a chapter without sections.
  </chapter>
</book>

Exercises: HTML selectors

  1. Practice finding the right selector with the CSS Diner game (https://flukeout.github.io/)
  2. Consider the toy HTML example below. Which selectors do you need to put into html_elements() (which extracts all elements matching the selector) to extract the information
library(rvest)
webpage <- "<html>
<body>
  <h1>Computational Research in the Post-API Age</h1>
  <div class='author'>Deen Freelon</div>
  <div>Keywords:
    <ul>
      <li>API</li>
      <li>computational</li>
      <li>Facebook</li>
    </ul>
  </div>
  <div class='text'>
    <p>Three pieces of advice on whether and how to scrape from Dan Freelon</p>
  </div>
  
  <ol class='advice'>
    <li id='one'> use authorized methods whenever possible </li>
    <li id='two'> do not confuse terms of service compliance with data protection </li>
    <li id='three'> understand the risks of violating terms of service </li>
  </ol>

</body>
</html>" |> 
  read_html()

Scraping Static Web Pages

Example: World Happiness Report

Use your Browser to Scout

Use your Browser’s Inspect tool

Note: Might not be available on all browsers; use Chromium-based or Firefox.

Use rvest to scrape

library(rvest)
library(tidyverse)

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=World_Happiness_Report&oldid=1165407285")

# 2. Parse
happy_table <- html |> 
  html_elements(".wikitable") |> # select the right element
  html_table() |>                # special function for tables
  pluck(3)                       # select the third table

# 3. No wrangling necessary
happy_table
# A tibble: 153 × 9
   `Overall rank` `Country or region` Score `GDP per capita` `Social support`
            <int> <chr>               <dbl>            <dbl>            <dbl>
 1              1 Finland              7.81             1.28             1.5 
 2              2 Denmark              7.65             1.33             1.50
 3              3 Switzerland          7.56             1.39             1.47
 4              4 Iceland              7.50             1.33             1.55
 5              5 Norway               7.49             1.42             1.50
 6              6 Netherlands          7.45             1.34             1.46
 7              7 Sweden               7.35             1.32             1.43
 8              8 New Zealand          7.3              1.24             1.49
 9              9 Austria              7.29             1.32             1.44
10             10 Luxembourg           7.24             1.54             1.39
# ℹ 143 more rows
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
#   `Freedom to make life choices` <dbl>, Generosity <dbl>,
#   `Perceptions of corruption` <dbl>
## Plot relationship wealth and life expectancy
ggplot(happy_table, aes(x = `GDP per capita`, y = `Healthy life expectancy`)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Exercises: Static Web Pages 1

  1. Get the table with 2023 opinion polling for the next United Kingdom general election from https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_United_Kingdom_general_election
  2. Wrangle and plot the data opinion polls

Example: UK prime ministers on Wikipedia

Use your Browser to Scout

Use rvest to scrape

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=List_of_prime_ministers_of_the_United_Kingdom&oldid=1166167337") # I'm using an older version of the site since some just changed it

# 2. Parse
pm_table <- html |> 
  html_element(".wikitable:contains('List of prime ministers')") |>
  html_table() |> 
  as_tibble(.name_repair = "unique") |> 
  filter(!duplicated(`Prime ministerOffice(Lifespan)`))

# 3. No wrangling necessary
pm_table
# A tibble: 75 × 11
   Portrait...1 Portrait...2 Prime ministerOffice(Lifespa…¹ `Term of office...4`
   <chr>        <chr>        <chr>                          <chr>               
 1 "Portrait"   "Portrait"   Prime ministerOffice(Lifespan) start               
 2 "​"           ""           Robert Walpole[27]MP for King… 3 April1721         
 3 "​"           ""           Spencer Compton[28]1st Earl o… 16 February1742     
 4 "​"           ""           Henry Pelham[29]MP for Sussex… 27 August1743       
 5 "​"           ""           Thomas Pelham-Holles[30]1st D… 16 March1754        
 6 "​"           ""           William Cavendish[31]4th Duke… 16 November1756     
 7 "​"           ""           Thomas Pelham-Holles[32]1st D… 29 June1757         
 8 ""           ""           John Stuart[33]3rd Earl of Bu… 26 May1762          
 9 ""           ""           George Grenville[34]MP for Bu… 16 April1763        
10 ""           ""           Charles Watson-Wentworth[35]2… 13 July1765         
# ℹ 65 more rows
# ℹ abbreviated name: ¹​`Prime ministerOffice(Lifespan)`
# ℹ 7 more variables: `Term of office...5` <chr>, `Term of office...6` <chr>,
#   `Mandate[a]` <chr>, `Ministerial offices held as prime minister` <chr>,
#   Party <chr>, Government <chr>, MonarchReign <chr>
<td rowspan="4">
  <span class="anchor" id="18th_century"></span>
   <b>
     <a href="/wiki/Robert_Walpole" title="Robert Walpole">Robert Walpole</a>
   </b>
   <sup id="cite_ref-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46_28-0" class="reference">
     <a href="#cite_note-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46-28">[27]</a>
   </sup>
   <br>
   <span style="font-size:85%;">MP for <a href="/wiki/King%27s_Lynn_(UK_Parliament_constituency)" title="King's Lynn (UK Parliament constituency)">King's Lynn</a>
   <br>(1676–1745)
  </span>
</td>
links <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_attr("href")
title <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_text()
tibble(name = title, link = links)
# A tibble: 90 × 2
   name                 link                                             
   <chr>                <chr>                                            
 1 Robert Walpole       /wiki/Robert_Walpole                             
 2 George I             /wiki/George_I_of_Great_Britain                  
 3 George II            /wiki/George_II_of_Great_Britain                 
 4 Spencer Compton      /wiki/Spencer_Compton,_1st_Earl_of_Wilmington    
 5 Henry Pelham         /wiki/Henry_Pelham                               
 6 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 7 William Cavendish    /wiki/William_Cavendish,_4th_Duke_of_Devonshire  
 8 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 9 George III           /wiki/George_III                                 
10 John Stuart          /wiki/John_Stuart,_3rd_Earl_of_Bute              
# ℹ 80 more rows

Note: these are relative links that need to be combined with https://en.wikipedia.org/ to work

Exercises: Static Web Pages 2

  1. For extracting text, rvest has two functions: html_text and html_text2. Explain the difference. You can test your explanation with the example html below.
html <- "<p>This is some text
         some more text</p><p>A new paragraph!</p>
         <p>Quick Question, is web scraping:

         a) fun
         b) tedious
         c) I'm not sure yet!</p>" |> 
  read_html()
  1. How could you convert the links objects so that it contains actual URLs?
  2. How could you add the links we extracted above to the pm_table to keep everything together?

Other Techniques

APIs

  • An Application Programming Interface (API) is a way for two computer programs to speak to each other
  • Commonly used to distribute data or do many other things (e.g., the defunct Twitter and Facebook APIs, NYT and Guardian APIs, MediaCloud API)
  • Good way to access APIs: httr2

API example: Guardian API

If you want to follow along:

library(httr2)
library(tidyverse, warn.conflicts = FALSE)
req <- request("https://content.guardianapis.com") |>  # start the request with the base URL
  req_url_path("search") |>                            # navigate to the endpoint you want to access
  req_method("GET") |>                                 # specify the method
  req_timeout(seconds = 60) |>                         # how long to wait for a response
  req_headers("User-Agent" = "httr2 guardian test") |> # specify request headers
  # req_body_json() |>                                 # since this is a GET request the body stays empty
  req_url_query(                                       # instead the query is added to the URL
    q = "parliament AND debate",
    "show-blocks" = "all"
  ) |>
  req_url_query(                                       # in this case, the API key is also added to the query
    "api-key" = "d187828f-9c6a-4c29-afd4-dbd43e116965"             # but httr2 also has req_auth_* functions for other
  )                                                    # authentication procedures
print(req)

Nothing is done until you perform the request:

resp <- req |> 
  req_perform()

Then you need to parse the response:

parse_response <- function(resp) {
  # make sure response is valid
  if (resp_content_type(resp) != "application/json") {
    stop("Request was not succesful!")
  }
  
  # extract articles
  results <- resp_body_json(resp) |> 
    pluck("response", "results")
  
  # parse into data.frame
  map(results, function(res) {
    tibble(
      id = res$id,
      type = res$type,
      time = lubridate::ymd_hms(res$webPublicationDate),
      headline = res$webTitle,
      text = rvest::read_html(pluck(res, "blocks", "body", 1, "bodyHtml")) |> rvest::html_text2()
    )
  }) |> 
    bind_rows()
  
}
parse_response(resp)
# A tibble: 10 × 5
   id                                   type  time                headline text 
   <chr>                                <chr> <dttm>              <chr>    <chr>
 1 australia-news/2023/nov/15/peter-du… arti… 2023-11-15 07:19:09 "Peter … "Ant…
 2 uk-news/2024/mar/22/jersey-debate-a… arti… 2024-03-22 14:45:27 "Jersey… "Jer…
 3 australia-news/2024/mar/25/labor-al… arti… 2024-03-25 08:08:13 "Labor … "The…
 4 world/2024/apr/11/poland-mps-debate… arti… 2024-04-11 13:58:50 "Polish… "Pol…
 5 australia-news/2024/jan/17/lord-pra… arti… 2024-01-16 14:00:03 "Should… "For…
 6 australia-news/live/2023/nov/30/pol… live… 2023-11-30 07:02:24 "Immigr… "Tha…
 7 society/2024/apr/18/hilary-cass-rep… arti… 2024-04-18 17:06:24 "Hilary… "In …
 8 australia-news/commentisfree/2024/m… arti… 2024-03-29 23:00:03 "The we… "Bla…
 9 australia-news/2023/nov/18/the-week… arti… 2023-11-17 23:00:04 "The we… "Ano…
10 world/live/2024/apr/29/europe-live-… live… 2024-04-29 19:15:38 "Europe… "Eig…

Exercises APIs

This call retrieves the first 10 articles that match the query:

library(httr2)
response <- request("https://content.guardianapis.com") |>  # start the request with the base URL
  req_url_path("search") |>                            # navigate to the endpoint you want to access
  req_method("GET") |>                                 # specify the method
  req_timeout(seconds = 60) |>                         # how long to wait for a response
  req_headers("User-Agent" = "httr2 guardian test") |> # specify request headers
  # req_body_json() |>                                 # since this is a GET request the body stays empty
  req_url_query(                                       # instead the query is added to the URL
    q = "parliament AND debate",
    "show-blocks" = "all"
  ) |>
  req_url_query(                                       # in this case, the API key is also added to the query
    "api-key" = "d187828f-9c6a-4c29-afd4-dbd43e116965" # but httr2 also has req_auth_* functions for other
  ) |> 
  req_perform() |> 
  resp_body_json()

View(response$response)
  1. How can you change the call to get 25 articles instead
  2. The call delivers the first page. How can you get to the second one?

Special Requests

  • Some websites limit requests
  • When you run read_html from rvest, it uses a default request that fits most of the time, but not always:
html <- read_html("https://www.icahdq.org/mpage/ICA23-Program")
Error in open.connection(x, "rb"): HTTP error 403.

To interpret HTTP errors, you can use this handy function:

error_cat <- function(error) {
  link <- paste0("https://http.cat/images/", error, ".jpg")
  knitr::include_graphics(link)
}
error_cat(403)

So what to do next?

  • Scope the Network tab
  • Translate Curl
  • Build request in R

Translate the cURL call

curl_translate("curl 'https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Referer: https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  -H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: \"Linux\"' \
  --compressed")
request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/") |> 
  req_url_query(
    event_id = "JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4=",
  ) |> 
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |> 
  req_perform()

Make request in R

ica_programme_data <- request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D") |> 
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    Connection = "keep-alive",
    Pragma = "no-cache",
    Referer = "https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/",
    `Sec-Fetch-Dest` = "empty",
    `Sec-Fetch-Mode` = "cors",
    `Sec-Fetch-Site` = "same-origin",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
  ) |> 
  req_perform() |> 
  resp_body_json()
object.size(ica_programme_data) |> 
  format("MB")
[1] "9.5 Mb"

It worked!

Special Requests: Behind Paywall

Let’s get this cool data journalism article.

html <- read_html("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende")
html |> 
  html_elements(".article-body p") |> 
  html_text2()
[1] "Ganz Deutschland fährt Bahn. So fühlte sich das im Sommer 2022 zumindest an, als das 9-Euro-Ticket für drei Monate für überfüllte Züge sorgte. Die Bundesregierung und viele Menschen zeigten sich begeistert: So leicht war es also, Bürgerinnen und Bürger für die umweltfreundlichen öffentlichen Verkehrsmittel zu begeistern, man muss nur ein günstiges Ticket für ganz Deutschland anbieten."
[2] "Aber als die Bundesregierung den Nachfolger vorstellte, waren viele enttäuscht. 49 Euro monatlich kostet das Deutschlandticket und ist nur im Abo erhältlich. Euphorisch war nur noch die Bundesregierung. Doch jetzt, ein Jahr nach dem Start, kann man sagen: zu Recht. Zumindest, was die Fahrgastzahlen angeht."                                                                                

:thinking: Wait, that’s only the first two paragraphs!

Special Requests: Behind Paywall Cookies!

library(cookiemonster)
add_cookies("cookies.txt")
html <- request("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende") |> # start a request
  req_options(cookie = get_cookies("zeit.de", as = "string")) |> # add cookies to be sent with it
  req_perform() |> 
  resp_body_html() # extract html from response

html |> 
  html_elements(".article-body p") |> 
  html_text2()
 [1] "Ganz Deutschland fährt Bahn. So fühlte sich das im Sommer 2022 zumindest an, als das 9-Euro-Ticket für drei Monate für überfüllte Züge sorgte. Die Bundesregierung und viele Menschen zeigten sich begeistert: So leicht war es also, Bürgerinnen und Bürger für die umweltfreundlichen öffentlichen Verkehrsmittel zu begeistern, man muss nur ein günstiges Ticket für ganz Deutschland anbieten."                                                                                                                                                                                                                           
 [2] "Aber als die Bundesregierung den Nachfolger vorstellte, waren viele enttäuscht. 49 Euro monatlich kostet das Deutschlandticket und ist nur im Abo erhältlich. Euphorisch war nur noch die Bundesregierung. Doch jetzt, ein Jahr nach dem Start, kann man sagen: zu Recht. Zumindest, was die Fahrgastzahlen angeht."                                                                                                                                                                                                                                                                                                           
 [3] "Zwar hat fast jeder zweite das 49-Euro-Ticket schon mindestens einmal gekündigt (PDF). Aber im Schnitt haben jeden Monat 11,2 Millionen Menschen eins. Die Aboquote im ÖPNV hat sich laut Verband Deutscher Verkehrsunternehmen (VDV) um 50 Prozent erhöht (PDF). Bei einer Befragung gab jeder Vierte an, dass er diese Bus- oder Bahnfahrt ohne ein Deutschlandticket nicht gemacht hätte."                                                                                                                                                                                                                                  
 [4] "Dieser Zuspruch zeigt sich in den Fahrgastzahlen. Die nähern sich denen aus dem 9-Euro-Sommer an, zumindest was Fahrten über 30 Kilometer angeht. Das zeigen Daten, die die Firma Teralytics für ZEIT ONLINE ausgewertet hat. Teralytics nutzt Mobilfunkdaten des Konzerns Telefónica, zu dem unter anderem O2 gehört. Anhand der Geschwindigkeit, mit der sich die Handys zwischen Funkzellen bewegen, kann Teralytics das genutzte Verkehrsmittel zuordnen."                                                                                                                                                                 
 [5] "Allerdings zeigen die Daten nur Zugfahrten mit mehr als 30 Kilometern, da die Zuordnung erst dann verlässlich ist. Bus- oder U-Bahn-Fahrten sind also nicht erfasst. Fahrten mit dem ICE oder IC, die mit dem Deutschlandticket nicht möglich sind, sind dagegen enthalten."                                                                                                                                                                                                                                                                                                                                                   
 [6] "Der Anstieg der Fahrten ab Einführung des 49-Euro-Tickets und die Parallelen zur Entwicklung beim 9-Euro-Ticket sprechen jedoch dafür, dass auch jetzt wieder viele der neuen Fahrten auf Regionalbahnen zurückzuführen sind. Passend dazu meldet die DB Regio 28 Prozent mehr Fahrgäste. Die Hälfte der Fahrten seien dabei privat veranlasst. Das zeigt: Das 49-Euro-Ticket wird bei Weitem nicht nur in der Freizeit genutzt."                                                                                                                                                                                              
 [7] "Und auch Nahverkehrsunternehmen berichten, dass die Fahrgastzahlen, die während Corona eingebrochen waren, fast wieder das alte Niveau erreicht haben und an den Wochenenden sogar überschreiten. Wer ein Deutschlandticket erwirbt, legt laut VDV im Anschluss monatlich im Schnitt 16 Kilometer mehr mit dem ÖPNV zurück als zuvor."                                                                                                                                                                                                                                                                                         
 [8] "Das neue Angebot hat es also geschafft, den Nahverkehr deutlich beliebter zu machen. Aber trägt es auch zur Verkehrswende bei? Nordrhein-Westfalens Verkehrsminister Oliver Krischer (Grüne) sagte vergangene Woche, dass das Ticket einen wichtigen Beitrag zum Klimaschutz leiste."                                                                                                                                                                                                                                                                                                                                          
 [9] "Doch dafür ist nicht entscheidend, ob die Menschen mehr Bus und Bahn fahren, sondern ob das Auto häufiger stehen bleibt. Befragungen deuten darauf hin. Laut einer Umfrage des Fraunhofer-Instituts für System- und Innovationsforschung, die ZEIT ONLINE vor Veröffentlichung einsehen konnte, ist nicht nur der Anteil der Wege, die mit dem ÖPNV zurückgelegt werden, bei den Deutschlandticket-Inhabern um neun Prozentpunkte gestiegen, sondern auch die Autonutzung um fünf Prozentpunkte zurückgegangen. Und laut der aktuellen VDV-Befragung wird das Auto 16 Prozent seltener genutzt."                               
[10] "In den Teralytics-Daten, die auch Autofahrten über 30 Kilometern erfassen, lässt sich das jedoch nicht erkennen. Sie haben im vergangenen Jahr sogar zugenommen."                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[11] "Und Daten, die der Navigationsanbieter TomTom für ZEIT ONLINE ausgewertet hat, zeigen, dass auch der Stau in den Städten unaufhörlich schlimmer wird."                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
[12] "Wie kann das sein? Eine Erklärung wäre, dass der Autoverkehr auch unabhängig vom Deutschlandticket zunimmt. Dass also die, die kein Ticket haben, mehr Auto fahren. Etwa, weil Unternehmen ihre Mitarbeitenden wieder öfter ins Büro holen und wieder mehr Dienst- und Urlaubsreisen stattfinden als während der Pandemie. Es könnte aber auch sein, dass das, was Befragte in Umfragen angeben, nicht immer der Realität entspricht. Dass Menschen überschätzen, wie sehr das Ticket ihre Autofahrten reduziert hat. Oder sie geben die Antwort, die gesellschaftlich erwünscht ist: Na klar reduziere ich meine Autofahrten."
[13] "Ganz Deutschland fährt Bahn. So fühlte sich das im Sommer 2022 zumindest an, als das 9-Euro-Ticket für drei Monate für überfüllte Züge sorgte. Die Bundesregierung und viele Menschen zeigten sich begeistert: So leicht war es also, Bürgerinnen und Bürger für die umweltfreundlichen öffentlichen Verkehrsmittel zu begeistern, man muss nur ein günstiges Ticket für ganz Deutschland anbieten."                                                                                                                                                                                                                           
[14] "Aber als die Bundesregierung den Nachfolger vorstellte, waren viele enttäuscht. 49 Euro monatlich kostet das Deutschlandticket und ist nur im Abo erhältlich. Euphorisch war nur noch die Bundesregierung. Doch jetzt, ein Jahr nach dem Start, kann man sagen: zu Recht. Zumindest, was die Fahrgastzahlen angeht."                                                                                                                                                                                                                                                                                                           

Interactive Website

static <- read_html("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")
static |> 
  html_elements(".MespJc") |> 
  html_text2()
character(0)

google maps commute

Interactive Website & Browser Automation

  • The new read_html_live from rvest solves this by emulating a browser:
# loads a real web browser
sess <- read_html_live("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")

# you can even take a look at what is happening with
# sess$view()
# cookies <- sess$session$Network$getCookies()
# saveRDS(cookies, "data/chromote_cookies.rds")
cookies <- readRDS("data/chromote_cookies.rds")
sess$session$Network$setCookies(cookies = cookies$cookies)
named list()
# the session behaves like a normal rvest html object
sess |> 
  html_elements(".MespJc") |> 
  html_text2() |> 
  str_extract(".+?min")
character(0)

Some of my other packages that can make your life easier

paperboy: get data from news media sites

paperboy::pb_deliver("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende",
                     use_cookies = TRUE)
# A tibble: 1 × 9
  url       expanded_url domain status datetime            author headline text 
  <chr>     <chr>        <chr>   <int> <dttm>              <chr>  <chr>    <chr>
1 https://… https://www… zeit.…    200 2024-05-01 04:40:02 https… Deutsch… "Z+M…
# ℹ 1 more variable: misc <list>

traktok: easy access to TikTok data

library(traktok)
df <- tt_search_hidden("#rstats", max_pages = 2)
df
# A tibble: 16 × 20
   video_id            video_timestamp     video_url    video_length video_title
   <chr>               <dttm>              <glue>              <int> <chr>      
 1 7252226153828584731 2023-07-05 07:01:45 https://www…           36 "Wow!!! TH…
 2 7350107863165046034 2024-03-25 01:32:34 https://www…          263 "Analysis …
 3 7351237579729145096 2024-03-28 02:36:25 https://www…          202 "Simple Li…
 4 7150834646039727402 2022-10-05 01:30:42 https://www…           47 "Replying …
 5 7141887653603020037 2022-09-10 22:51:46 https://www…           18 "Working w…
 6 6866476393312603397 2020-08-29 18:35:25 https://www…           42 "Why you s…
 7 7312089105867885857 2023-12-13 14:40:20 https://www…           47 "Quick R Q…
 8 7323699816263863585 2024-01-13 21:35:47 https://www…           68 "Fc 24 aze…
 9 6867247174250401029 2020-08-31 20:26:29 https://www…           57 "Navigatin…
10 7156952318175497514 2022-10-21 13:10:48 https://www…           48 "#NBA #Bas…
11 7163734204994293038 2022-11-08 19:47:38 https://www…            7 "So frustr…
12 7348883620708338977 2024-03-21 18:21:50 https://www…           15 "summer be…
13 7291292897930939681 2023-10-18 13:40:23 https://www…           50 "Quick R Q…
14 7289581166955416864 2023-10-13 22:58:07 https://www…           59 "Quick R Q…
15 7272223489384254753 2023-08-28 04:22:05 https://www…           46 "Quick R Q…
16 7254926524996947227 2023-07-12 13:40:20 https://www…           27 "How to Co…
# ℹ 15 more variables: video_diggcount <int>, video_sharecount <int>,
#   video_commentcount <int>, video_playcount <int>, video_is_ad <lgl>,
#   author_name <chr>, author_nickname <chr>, author_followercount <int>,
#   author_followingcount <int>, author_heartcount <int>,
#   author_videocount <int>, author_diggcount <int>, music <list>,
#   challenges <list>, download_url <chr>
tt_videos_hidden(df$video_url[1])
# A tibble: 1 × 25
  video_id            video_url     video_timestamp     video_length video_title
  <glue>              <chr>         <dttm>                     <int> <chr>      
1 7252226153828584731 https://www.… 2023-07-05 07:01:45           36 Wow!!! THI…
# ℹ 20 more variables: video_locationcreated <chr>, video_diggcount <int>,
#   video_sharecount <int>, video_commentcount <int>, video_playcount <int>,
#   author_id <chr>, author_secuid <chr>, author_username <chr>,
#   author_nickname <chr>, author_bio <chr>, download_url <chr>,
#   html_status <int>, music <list>, challenges <list>, is_secret <lgl>,
#   is_for_friend <lgl>, is_slides <lgl>, video_status <chr>,
#   video_status_code <int>, video_fn <chr>

Shameless self-promotion

ESS

https://essexsummerschool.com/summer-school-facts/courses/ess-2024-course-list/2v-introduction-to-web-scraping-and-data-management-for-social-scientists/

Should you use Webscraping?

Are you Allowed to use Webscraping?

Web Scraping is not a shady or illegal activity, but not all web scraping is unproblematic and the data does not become yours.

  • Collecting personal data of people in the EU might violate GDPR (General Data Protection Regulation)
    • The GDPR defines personal data as “any information relating to an identified or identifiable natural person.” (Art. 4 GDPR)
    • Exceptions
      • if you get consent from the people whose data it is
      • personal data processing is legitimate when “necessary for the performance of a task carried out in the public interest” (Art. 6 GDPR)
  • Collecting copyrighted data
    • Complicated legal situation
    • Public facing content is probably okay (9th circuit ruling)
    • “there have been no lawsuits in […] major western democratic countries stemming from a researcher scraping publicly accessible data from a website for personal or academic use.” (Luscombe, Dick, and Walby 2022)
    • You will probably get in trouble if you distribute the material
  • Honouring Terms of Service and robots.txt
    • Many companies have ToS that might prohibit you from scraping (these are not laws, might not be binding and whether they can be enforced is a separate question)
    • /robots.txt is often where guidelines are communicated to automated crawlers

ToS and Robots.txt

Twitter ToS

User-agent: *                         # the rules apply to all user agents
Disallow: /EPiServer/CMS/             # do not crawl any URLs that start with /EPiServer/CMS/
Disallow: /Util/                      # do not crawl any URLs that start with /Util/ 
Disallow: /about/art-in-parliament/   # do not crawl any URLs that start with /about/art-in-parliament/

https://www.parliament.uk/robots.txt

Ethical

  • Are there other means available to get to the data (e.g., via an API)?
  • robots.txt might not be legally binding, but it is not nice to ignore it
  • Scraping can put a heavy load on a website (if you make 1000s of requests) which costs the hosts money and might bring down a site (DDoS attack)
  • Think twice before scraping personal data. You should ask yourself:
    • is it necessary for your research?
    • are you harming anyone by obtaining (or distributing) the data?
    • do you really need everything or are parts of the data sufficient (e.g., can you preselect cases or ignore variables)

Advice?

Legal and ethical advice is rare and complicated to give. A good opinion piece about it is Freelon (2018). It is worth reading, but can be summarised in three general pieces of advice

  • use authorized methods whenever possible
  • do not confuse terms of service compliance with data protection
  • understand the risks of violating terms of service

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Amsterdam
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] traktok_0.0.4.9000  cookiemonster_0.0.3 httr2_1.0.1        
 [4] lubridate_1.9.3     forcats_1.0.0       stringr_1.5.1      
 [7] dplyr_1.1.4         purrr_1.0.2         readr_2.1.5        
[10] tidyr_1.3.1         tibble_3.2.1        ggplot2_3.4.4      
[13] tidyverse_2.0.0     rvest_1.0.4        

loaded via a namespace (and not attached):
 [1] paperboy_0.0.5.9000 gtable_0.3.4        xfun_0.43          
 [4] websocket_1.4.1     processx_3.8.3      callr_3.7.3        
 [7] tzdb_0.4.0          vctrs_0.6.5         tools_4.3.3        
[10] ps_1.7.6            generics_0.1.3      curl_5.2.1         
[13] fansi_1.0.6         adaR_0.3.1          pkgconfig_2.0.3    
[16] lifecycle_1.0.4     compiler_4.3.3      munsell_0.5.0      
[19] chromote_0.1.2      codetools_0.2-19    htmltools_0.5.8.1  
[22] yaml_2.3.8          later_1.3.2         pillar_1.9.0       
[25] openssl_2.1.2       tidyselect_1.2.0    digest_0.6.35      
[28] stringi_1.8.3       fastmap_1.1.1       grid_4.3.3         
[31] colorspace_2.1-0    cli_3.6.2           magrittr_2.0.3     
[34] triebeard_0.4.1     utf8_1.2.4          withr_3.0.0        
[37] scales_1.3.0        promises_1.2.1      rappdirs_0.3.3     
[40] timechange_0.3.0    rmarkdown_2.26      httr_1.4.7         
[43] askpass_1.2.0       hms_1.1.3           evaluate_0.23      
[46] knitr_1.46          rlang_1.1.3         urltools_1.7.3     
[49] Rcpp_1.0.12         docopt_0.7.1        glue_1.7.0         
[52] selectr_0.4-2       xml2_1.3.6          rstudioapi_0.15.0  
[55] jsonlite_1.8.8      R6_2.5.1           

References

Freelon, Deen. 2018. “Computational Research in the Post-API Age.” Political Communication 35 (4): 665–68. https://doi.org/10.1080/10584609.2018.1477506.
Luscombe, Alex, Kevin Dick, and Kevin Walby. 2022. “Algorithmic Thinking in the Public Interest: Navigating Technical, Legal, and Ethical Hurdles to Web Scraping in the Social Sciences.” Quality & Quantity 56 (3): 1023–44. https://doi.org/10.1007/s11135-021-01164-0.